03 Kiva Loans: Bivariate EDA

In this file, we begin to consider relationships between variables. We have two basic types of questions that we’re interested in:

  1. What are the relationships between loan initialization variables? Questions of this type will lead us toward understanding how loans vary from country to country, demographic to demographic, how descriptions relate to loan amounts, and so forth.
  2. What are the inter-relationships between loan initialization variables and variables down the pipeline? Questions of this type will lead us toward understanding how the features of the loan – amount, country, demographics – relate to how that loan is funded, disbursed, and repaid.

Given the negative values in the disbursal_timediff variable, however, we discussed cutting the supply chain in two and examining the “posting-to-funding” component separately from the “disbursing-to-repaying” component.

We also generated some data partitioning ideas at the end of our single-variable EDA, namely:

Let’s load the data and get started.

library(jsonlite)
library(ggplot2)
library(psych)
library(RColorBrewer)
library(countrycode)

loans = jsonlite::fromJSON("newpaid.json")

loans$posted_datetime = strptime(loans$posted_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$funded_datetime = strptime(loans$funded_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$disbursal_datetime = strptime(loans$disbursal_datetime, format = "%Y-%m-%d %H:%M:%S")
loans$paid_datetime = strptime(loans$paid_datetime, format = "%Y-%m-%d %H:%M:%S")

We know that we have no missing data, so we’ll dive right in.

Relationships among loan features

First, we’ll generate a scatterplot matrix for the numeric loan feature variables:

pairs.panels(loans[, c(8, 9, 17, 18)])

Most pairs of variables are not correlated. There is no relationship between the length of a loan’s description and its amount, the number of borrowers, or the gender composition of its borrowers, nor is the gender composition of the borrowers related to the amount or number of borrowers. (The apparent structure in the number-female plot is merely an artifact of integer denominators.) The only correlation is between the loan amount and the number of borrowers. Although many loans, regardless of amount, have a single borrower – perhaps a representative for a larger group – the general trend is that a larger loan means more borrowers.

We see a few high-leverage points, where the loan amount is greater than 5000 USD, but these do not seem to be throwing off the overall trends.

Next, we want to consider categorical features of the loan, country and sector, in relation to our other loan variables. Let’s look at sector and loan amount first:

ggplot(data=loans, aes(x=sector, y=terms.loan_amount, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Loan amount") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

For the bulk of the loans, there isn’t much difference between sectors: the mass is concentrated below about 1500 USD. Sector distinctions seem to matter more when we look at the tails. The education, food, manufacturing, and service sectors all have multiple loans that extend well beyond 1500 USD. We can examine this segment of loans values by itself:

large_loans = loans[loans$terms.loan_amount >= 1500, ]
ggplot(data=large_loans, aes(x=sector, y=terms.loan_amount, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Loan amount") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Even for this segment, though, the median and IQR’s by sector are comparable. Perhaps segmenting by loan amount won’t prove interesting after all.

How does the sector relate to the number of borrowers? Loans with one borrower will dominate the plot, so let’s look only at loans with more than one borrowers:

ggplot(data=loans[loans$b.num>1, ], aes(x=sector, y=b.num, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Number of borrowers") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Again, the ranges are quite comparable. Entertainment sector is the only real stand-out.

Is there a relationship between gender and sector?

ggplot(data=loans, aes(x=sector, y=b.female, colour=sector)) + 
  geom_boxplot() + scale_colour_discrete(guide=FALSE) +
  xlab("Sector") + ylab("Fraction of females in borrowing group") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Yes, there are certain sectors where the borrowers are almost exclusive female: arts, clothing, food, and retail. This is the exception, however. All in all, sector doesn’t seem to reveal much when it comes to the characteristics of a loan. We can quantify its effect on the loan amount:

summary.lm(aov(terms.loan_amount ~ sector, loans))
## 
## Call:
## aov(formula = terms.loan_amount ~ sector, data = loans)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -862.9 -430.4 -192.4  187.1 9282.6 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          716.6243    13.7199  52.232  < 2e-16 ***
## sectorArts            95.4826    42.6292   2.240  0.02512 *  
## sectorClothing       152.0449    27.3213   5.565 2.67e-08 ***
## sectorConstruction    98.5481    44.4357   2.218  0.02659 *  
## sectorEducation      163.6535    77.0978   2.123  0.03380 *  
## sectorEntertainment  133.3757   180.4572   0.739  0.45986    
## sectorFood             0.7903    18.1300   0.044  0.96523    
## sectorHealth         182.1045    67.6630   2.691  0.00713 ** 
## sectorHousing         -8.4243    39.6186  -0.213  0.83162    
## sectorManufacturing   98.8635    52.1075   1.897  0.05781 .  
## sectorPersonal Use    46.9174    74.7284   0.628  0.53012    
## sectorRetail          17.2302    18.6082   0.926  0.35449    
## sectorServices       171.2642    25.7245   6.658 2.89e-11 ***
## sectorTransportation -36.1917    35.0791  -1.032  0.30222    
## sectorWholesale       31.9051   124.1945   0.257  0.79726    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 719.7 on 13764 degrees of freedom
## Multiple R-squared:  0.007289,   Adjusted R-squared:  0.006279 
## F-statistic: 7.219 on 14 and 13764 DF,  p-value: 3.424e-15

The very low adjusted R-squared means a negligible effect size.

What about country? Does it help explain the characteristics of a loan?

ggplot(data=loans, aes(x=location.country_code, y=terms.loan_amount, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Loan amount") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Here, we see a significant variance in loan amount, within and across countries. Some countries have a mean loan amount of over 3000 USD (the U.S., notably), even as many have a mean of under 1000 USD. Even for these latter countries, however, there can be a large number of loans in the upper quartile. (These are the dots lying outside the boxplots.)

Let’s remove countries that don’t show up much in our sample before analyzing how country relates to other loans variables.

high_countries = table(loans$location.country_code) >= 30
hc_vec = high_countries[loans$location.country_code] == TRUE
hc_loans = loans[hc_vec, ] 

An ANOVA shows that country does a better job of explaining the variance in loan amount:

summary.lm(aov(terms.loan_amount ~ location.country_code, hc_loans))
## 
## Call:
## aov(formula = terms.loan_amount ~ location.country_code, data = hc_loans)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1651.1  -269.8   -97.5   151.5  4348.2 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               704.78      70.69   9.970  < 2e-16 ***
## location.country_codeAM  1246.28     110.58  11.271  < 2e-16 ***
## location.country_codeAZ   531.89      79.50   6.691 2.31e-11 ***
## location.country_codeBJ  -213.33      89.32  -2.388 0.016935 *  
## location.country_codeBO   563.67      76.85   7.335 2.35e-13 ***
## location.country_codeCM  -153.35     121.27  -1.265 0.206054    
## location.country_codeCO   -13.18      88.62  -0.149 0.881743    
## location.country_codeCR   231.38     105.19   2.200 0.027851 *  
## location.country_codeDO   870.02      87.72   9.919  < 2e-16 ***
## location.country_codeEC   115.05      76.80   1.498 0.134162    
## location.country_codeGH  -135.53      76.17  -1.779 0.075236 .  
## location.country_codeGT   288.79      99.26   2.910 0.003625 ** 
## location.country_codeHN  -127.86      88.89  -1.438 0.150358    
## location.country_codeID  -187.85      92.40  -2.033 0.042059 *  
## location.country_codeIQ  1545.22     118.07  13.088  < 2e-16 ***
## location.country_codeKE  -231.28      73.51  -3.146 0.001659 ** 
## location.country_codeKG   626.35      90.57   6.916 4.87e-12 ***
## location.country_codeKH    64.92      72.92   0.890 0.373341    
## location.country_codeLB   549.71      81.01   6.785 1.21e-11 ***
## location.country_codeLR  -260.60      98.25  -2.653 0.007998 ** 
## location.country_codeML   132.63      86.07   1.541 0.123326    
## location.country_codeMN   591.03      81.99   7.209 5.95e-13 ***
## location.country_codeMX   207.07      76.45   2.709 0.006767 ** 
## location.country_codeMZ  -190.80     103.71  -1.840 0.065845 .  
## location.country_codeNG   -95.35      77.40  -1.232 0.218008    
## location.country_codeNI  -157.57      73.74  -2.137 0.032638 *  
## location.country_codeNP  -468.35     121.27  -3.862 0.000113 ***
## location.country_codePE   -77.97      72.11  -1.081 0.279602    
## location.country_codePH  -403.84      71.89  -5.618 1.97e-08 ***
## location.country_codePK   182.14      82.10   2.219 0.026527 *  
## location.country_codePS  1165.03     106.81  10.907  < 2e-16 ***
## location.country_codePY   668.48      81.06   8.247  < 2e-16 ***
## location.country_codeQS  -332.23      84.96  -3.910 9.26e-05 ***
## location.country_codeRW   177.18      84.55   2.096 0.036127 *  
## location.country_codeSL   126.58      84.00   1.507 0.131865    
## location.country_codeSN   338.81      82.32   4.116 3.88e-05 ***
## location.country_codeSV   -46.79      80.05  -0.585 0.558848    
## location.country_codeTG    95.73      79.94   1.198 0.231092    
## location.country_codeTJ   236.75      75.35   3.142 0.001682 ** 
## location.country_codeTZ    56.71      78.67   0.721 0.471016    
## location.country_codeUA   541.20      94.36   5.736 9.92e-09 ***
## location.country_codeUG   413.66      75.53   5.477 4.41e-08 ***
## location.country_codeVN    80.09      82.10   0.976 0.329287    
## location.country_codeWS   -81.76      84.08  -0.972 0.330852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 582.9 on 13502 degrees of freedom
## Multiple R-squared:  0.2405, Adjusted R-squared:  0.2381 
## F-statistic: 99.45 on 43 and 13502 DF,  p-value: < 2.2e-16

The adjusted R-squared is 0.23, much better than the same statistic for sector. (If we include all countries, it climbs to 0.32.)

How does country relate to number of borrowers?

ggplot(data=hc_loans, aes(x=location.country_code, y=b.num, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Number of borrowers") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, loans with one borrower dominate, yet there are some countries where the number of borrowers tends to range more widely, such as Paraguay (PY).

Does country relate to the gender composition of the borrowers?

ggplot(data=hc_loans, aes(x=location.country_code, y=b.female, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Fraction of females in borrower group") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

In many countries, the loans are domainated by female borrower groups, but in many others, there are male groups as well. The most interesting feature of the above plot is probably the countries where there are a large number of mixed-gender borrower groups: Cambodia (KH), Peru (PE), The Philippines (PH), and Tanzania (TZ).

Finally, we can look at how our two categorical variables, country and sector, relate to one another. In our univariate EDA, we saw that agriculture, food, and retail were the main sector represented in the loan data. Is this true for each country?

getPalette = colorRampPalette(brewer.pal(11, "Spectral"))

num_sectors = length(unique(hc_loans$sector))

ggplot(data=hc_loans, aes(x=location.country_code, fill=sector)) + 
  geom_histogram(position="fill") + 
  scale_fill_manual(values = getPalette(num_sectors)) + 
  xlab("Country") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Not exactly. Those are the three largest sectors, but their relative frequencies differ from country to country. In some countries, agriculture accounts for well over 50% of loans, or even 100% of loans; in others, agriculture loans are outnumbered by food and retail loans.

The number of sectors can make it hard to be sure of these frequencies, so let’s look at the same plot but only for these top three sectors:

top_sectors = c("Agriculture", "Food", "Retail")
top_sector_loans = hc_loans[hc_loans$sector %in% top_sectors, ]

ggplot(data=top_sector_loans, aes(x=location.country_code, fill=sector)) + 
  geom_histogram(position="fill") + 
  scale_fill_manual(values = getPalette(3)) + 
  xlab("Country") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5,size=6))

Now the distinctions are clear. This suggests that when we move on to multivariate analysis, we try conditioning on the borrower’s country before examining how sector relates to loan properties.

Time series

We have not yet looked at time series. Let’s do that now. We note that loans are not evenly spaced in time, so any time series analysis would quickly land us in complex territory. However, we can get a sense of the overall activity by looking the loan amount at the posted_datetime variable:

ggplot(data=loans, aes(x=posted_datetime, y=terms.loan_amount)) + geom_line() +
  xlab("Time") + ylab("Loan Amount") + theme_bw()

The sporadic early activity suggests we not use dates early than 2007. Let’s narrow the window and use a smoothing function to get a better sense of the trend:

loans07 = loans[loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]

ggplot(data=loans07, aes(x=posted_datetime, y=terms.loan_amount)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Loan Amount") + theme_bw()
## geom_smooth: method="auto" and size of largest group is >=1000, so using gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the smoothing method.

Total loan amount is growing, as we would expect with a growing business. How do the number of borrowers change over time?

ggplot(data=loans07, aes(x=posted_datetime, y=b.num)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Number of borrowers") + theme_bw()

Generally, this value goes up. And what about the gender composition of the borrowers?

ggplot(data=loans07, aes(x=posted_datetime, y=b.female)) + 
  geom_smooth(lwd=2, level=0.99) +
  xlab("Time") + ylab("Fraction of females in borrower group") + theme_bw()

This value climbs for the first year – meaning that, on average, a greater fraction of people in the borrowing group are female – and then dips for the next three years.

How do loan amounts change over time for the top three sectors?

ggplot(data=loans07[loans07$sector %in% top_sectors, ],
       aes(x=posted_datetime, y=terms.loan_amount, colour=sector)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") + 
    scale_colour_discrete(name="Sector") +
  theme_bw()

And finally, we ask: How do loan amounts change over time by country? To avoid cluttering the plot, we can look at only the top five countries, in terms of how many loans they received:

top_countries = sort(table(loans07$location.country_code), decreasing=T)[1:5]

ggplot(data=loans07[loans07$location.country_code %in% names(top_countries), ], 
       aes(x=posted_datetime, y=terms.loan_amount, colour=location.country_code)) + 
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Loan Amount") + 
  scale_colour_discrete(name="Country") + theme_bw()

Kenya had a spike in 2008 in which they took higher-value loans, while the per-loan value for other countries has been increasingly more stably – Nicaragua and Cambodia, in particular.

Relationships between loan and pipeline features

Let’s revisit our scatterplot matrix, only now we will add “pipeline features” – that is, the timediffs that represent the difference in days between the different stages in the loan pipeline. We can also add in the number of lenders.

pairs.panels(loans[, c(8, 9, 17, 18, 5, 14, 15, 16)])

Recall that we are separating the funding component of the pipeline from the repayment component. That noted, we observe that the loan amount and the funding time are correlated, albeit weakly; this is not surprising. What is somewhat more surprising is the correlation (again, weak) between funding time and gender. This correlation is negative, meaning that as the fraction of females in a borrowing group increases, the funding time tends to drop. There are similar trends, though even weaker, between the same variables and the repayment time.

The number of borrowers and the length of the description have no apparent effect on how quickly a loan is funded (or repaid). Lender count, meanwhile, is correlated with variables that one might expect: the loan amount, the number of borrowers, and the funding time. It is not correlated with the gender composition of the borrowing group. We can quickly see, too, that the dollar per lender is completely uncorrelated with gender:

cor(loans$b.female, loans$terms.loan_amount/loans$lender_count)
## [1] -0.01251265

If gender does have an impact on lenders’ behavior, it is not in how much they give, but rather on whether they give – and the more people that do give, the more quickly the loan is funded.

Now we ask how sector and country relate to pipeline variables. We’ll begin with sector and funding time:

ggplot(data=loans, aes(x=sector, y=funded_timediff, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Funding time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

The median funding time for all sector is quite low, with slightly larger IQR’s for sectors that one might expect: housing and transportation. The length of the each sector’s tail is the main distinguishing feature.

Does the number of lenders vary by sector?

ggplot(data=loans, aes(x=sector, y=lender_count, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Number of lenders") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Not really. As we saw before, sector and loan amount aren’t very related, and since loan amount is reasonably correlated with the number of lenders, the lack of a relationship here isn’t surprising.

Does sector help explain any variance in repayment time?

ggplot(data=loans, aes(x=sector, y=paid_timediff, 
                       colour=sector)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Sector") + ylab("Repayment time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, the answer is not really.

We move on to examining the relationship of the borrower’s country to the funding time.

ggplot(data=hc_loans[hc_loans$funded_timediff >= 0, ], 
       aes(x=location.country_code, y=funded_timediff, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Funding time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

There are a few countries whose median value is much higher than the global median, which is a mere 0.71 days. These counties include:

hc_loans_0 = hc_loans[hc_loans$funded_timediff >= 0, ]
long_fund = sort(tapply(hc_loans_0$funded_timediff, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(long_fund), origin="iso2c", destination="country.name")
##  [1] "Iraq"                            "Palestine, State of"            
##  [3] "Lebanon"                         "Armenia"                        
##  [5] "Colombia"                        "El Salvador"                    
##  [7] "Azerbaijan"                      "Bolivia, Plurinational State of"
##  [9] "Kyrgyzstan"                      "Tajikistan"

How does the number of lenders relate to the borrowing country?

ggplot(data=hc_loans, aes(x=location.country_code, y=lender_count, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Number of lenders") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Again, we see a fair amount of variance in the medians. Which countries require the most lenders?

high_lend_ct = sort(tapply(hc_loans_0$lender_count, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_lend_ct), origin="iso2c", destination="country.name")
##  [1] "Iraq"                "Armenia"             "Palestine, State of"
##  [4] "Kyrgyzstan"          "Dominican Republic"  "Mongolia"           
##  [7] "Lebanon"             "Ukraine"             "Costa Rica"         
## [10] "Azerbaijan"

Some of these names appeared in the previous list, but others are new. If we look at the median loan amount by country, we can get a sense for why (some of) these countries require more lenders and more funding time:

high_loan_amt = sort(tapply(hc_loans_0$terms.loan_amount, 
            hc_loans_0$location.country_code, median), decreasing=T)[1:10]
countrycode(names(high_loan_amt), origin="iso2c", destination="country.name")
##  [1] "Iraq"                "Palestine, State of" "Armenia"            
##  [4] "Dominican Republic"  "Azerbaijan"          "Lebanon"            
##  [7] "Kyrgyzstan"          "Mongolia"            "Ukraine"            
## [10] "Costa Rica"

They have some of the biggest loans. (There are countries that have even larger loans, but fewer loans in number, such as the United States.)

Finally, we examine repayment time by country.

ggplot(data=hc_loans, aes(x=location.country_code, y=paid_timediff, 
                       colour=location.country_code)) + geom_boxplot() + 
  scale_colour_discrete(guide=FALSE) + 
  xlab("Country") + ylab("Repayment time") +
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=6))

Here, the disparities are quite significant. If we wish to predict how long it will take a borrower to pay back a loan, that borrower’s country could certainly be helpful.

How do these new pipeline variables relate to the calendar time? We can check.

ggplot(data=loans07,
       aes(x=posted_datetime, y=funded_timediff)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Funding time") +
  theme_bw()

Funding time has been slowly increasing – but not by much. The maximum toward the end of record keeping could stand some investigation.

Do we see a corresponding increase in the repayment time?

ggplot(data=loans07,
       aes(x=posted_datetime, y=paid_timediff)) +
  geom_smooth(lwd=2, level=0.99) + xlab("Time") + ylab("Repayment time") + 
    scale_colour_discrete(name="Sector") +
  theme_bw()

From mid-2009 on, the answer is yes: repayment time is increasing. However, there is a different unexplained maximum toward the start of recording keeping.

Summary

Here is what we have learned:

  • Loan amount is correlated with some expected variables: the number of borrowers, the funding time, the lender count, and the repayment time.
  • The fraction of the borrowing group’s composition that is female is negatively correlated with funding time, albeit weakly.
  • Sector doesn’t tell us much about the characteristics of the loan – not its amount, not its number of borrowers – nor does it explain much about how long a loan will take to get funded or repaid. Some sectors tend to have borrowing groups that are female-only, while others are more mixed-gender.
  • The country of the borrower tell us much more about a loan’s amount. There is also greater variance within countries when it comes to gender and number of borrowers. When we condition on country, we also see a greater variance by sector than we do unconditionally.
  • Country also seems to account for some of the variance in the pipeline variables: number of lenders, funding time, and payback time. To some extent, this may be a conflation of country with loan amount, since these two are correlated.
  • Time series are complicated by the irregular observation intervals. That said, we see average loan amount and number of borrowers grow over time. Sector-wise, retail and agriculture loan amount grow in a fairly stable manner, food less so. Not all countries exhibit growth over the five-year period of the loan data set. Funding and repayment time both trend up, but both display curious maxima.

We also encountered some curiosities in the data that we should address before moving on. Specifically:

  • We will remove countries that don’t appear at least 30 times in the data.
  • We will remove loans that have a funding time less than 0.
  • We will remove high-leverage loan amounts – i.e. loan amounts greater than 6000.
  • We will remove loans that happened before 2007.
head(hc_loans)
##       id status         sector partner_id lender_count
## 1 334113   paid           Food        183           42
## 2 313179   paid Transportation        145           24
## 3 206002   paid         Retail        145           15
## 4 354610   paid      Education        113           15
## 5 226099   paid    Agriculture        109           10
## 6  78960   paid           Food         70           85
##   description.languages location.country_code terms.loan_amount
## 1                    en                    SL              1175
## 2                    en                    PH               600
## 3                    en                    PH               425
## 4                es, en                    GT               400
## 5                    en                    KH               500
## 6                    en                    PE              2725
##   description.avgchar     posted_datetime     funded_datetime
## 1                 690 2011-09-11 00:40:06 2011-09-29 01:06:26
## 2                 524 2011-07-04 15:20:03 2011-07-08 06:18:03
## 3                1637 2010-07-01 07:00:16 2010-07-15 17:11:28
## 4                 472 2011-11-05 23:50:06 2011-11-06 02:52:45
## 5                 731 2010-10-01 07:00:09 2010-10-02 16:54:39
## 6                1890 2008-12-12 16:30:14 2008-12-16 15:52:58
##         paid_datetime  disbursal_datetime funded_timediff
## 1 2012-09-15 09:19:24 2011-08-15 07:00:00         18.0183
## 2 2012-01-15 11:01:44 2011-06-13 07:00:00          3.6236
## 3 2010-12-15 15:08:24 2010-05-07 07:00:00         14.4244
## 4 2012-05-15 08:52:45 2011-10-10 07:00:00          0.1685
## 5 2011-05-15 10:07:34 2010-07-28 07:00:00          1.4128
## 6 2009-06-15 22:51:05 2008-11-30 08:00:00          3.9741
##   disbursed_timediff paid_timediff b.num b.female
## 1           -44.7545      397.0968     1        0
## 2           -24.9709      216.2095     1        1
## 3           -69.4246      222.3808     1        1
## 4           -26.8700      218.0783     1        1
## 5           -66.4130      291.1303     1        1
## 6           -16.3284      197.5771    13        1
hc_loans2 = hc_loans[hc_loans$funding_timediff >= 0, ]
hc_loans3 = hc_loans2[hc_loans$terms.loan_amount < 6000, ]
hc_loans4 = hc_loans3[hc_loans$posted_datetime > strptime("2007-01-01", format="%Y-%m-%d"), ]

hc_loans4_json = jsonlite::toJSON(hc_loans4)
write(hc_loans4_json, file="hc_loans4.json")